arXiv:1504.01497v2 [cs.DS] 10Jul2015 


ReHub. Extending Hub Labels for Reverse /^-Nearest Neighbor 

Queries on Large-Scale networks 


Alexandros Efentakis 
Research Center ''Athena" 
ef entakis@imis. athena-innovation.gr 


Dieter Pfoser 
George Mason University 
dpfoser@gmu.edu 


July 13, 2015 


Abstract 


Quite recently, the algorithmic community has focused on solving multiple shortest-path query 
problems beyond simple vertex-to-vertex queries, especially in the context of road networks. 
Unfortunately, this research cannot be generalized for large-scale graphs, e.g., social or collaboration 
networks, or to efficiently answer Reverse ^-Nearest Neighbor (R^NN) queries, which are of 
practical relevance to a wide range of applications. To remedy this, we propose ReHub, a novel main- 
memory algorithm that extends the Hub Labeling technique to efficiently answer R^NN queries on 
large-scale networks. Our experimentation will show that ReHub is the best overall solution for this 
type of queries, requiring only minimal preprocessing and providing very fast query times. 

1 Introduction 

During the last two decades, the algorithmic community has produced significant results regarding 
vertex-to-vertex shortest-path queries, especially in the context of transportation networks (cf. 0 for 
the latest overview). Recently, this focus shifted to additional types of shortest-path (SP) queries, 
such as one-to-all (finding SP distances from a source vertex ^ to all other graph vertices), one-to-many 
(computing the SP distances between the source vertex ^ and all vertices of a set of targets T), range (find 
all nodes reachable from ^ within a given timespan), many-to-many (calculate a distance table between 
two sets of vertices S and T) and ^NN queries. Recent contributions here include IlMlI (one-to-all), 
HTTlI (one-to-many, many-to-many), |2Q|| (one-to-all, range, one-to-many) and ||T8l|2Tl (^NN queries). 
Unfortunately, most of these methods target road networks and, thus, cannot easily be used for denser, 
small-diameter graphs, such as social and collaboration networks. 

In the case of large-scale networks, the prevailing technique for vertex-to-vertex shortest-path 
queries is based on the 2-hop labeling, or. Hub Labeling (HL) algorithm ||22l[T3|. During preprocessing, 
we calculate for every vertex v a forward label Lf(v) and a backward label L^(v). These labels are 
subsequently used to very fast answer vertex-to-vertex shortest-path queries. The HL technique has 
been adapted successfully to road networks m |3l [HI IH and quite recently has also been extended 
to undirected, unweighted graphs ||5l[l5l|24l. The HL method has also been used for one-to-many, 
many-to-many and ^NN queries in road networks IIT7l[T]|. 

Another very important type of queries is the Reverse k-nearest neighbor (R/:NN) problem, initially 
proposed in 12511 . Given a query point q and a set of objects P, the R^NN query retrieves all the objects 
in P that have q as one of their ^-nearest neighbors according to a distance function distO. In Euclidean 
space, the distance dist(s, t) refers to the Euclidean distance between two objects ^ and t. For graphs, 
dist{s, t) corresponds to the minimum network distance between the two objects. R^NN queries may be 


used in various domains, ranging from geomarketing to location-based services and a wide-range of 
applications, including resource allocation, profile-based marketing and decision support ESI. Despite 
their importance and the fact that there is some scientific literature discussing R^NN queries for road 
networks Eoi nil uni, to the best of our knowledge, the only R^NN work focusing on other types 
of graphs is Il32l . Unfortunately, all those previous works share some inherent limitations, such as 
assuming that the graph does not fit in main memory (and therefore is stored on secondary storage), 
require query times of a few seconds which prohibits their use in real-time applications and most 
importantly, they do not scale particularly well with respect to the network size, the object density, the 
distribution of objects and the cardinality of the reverse ^-nearest neighbor result. 

Putting everything together, the ambition of this work is to provide an efficient and fast main- 
memory algorithm for answering R^NN queries on large-scale graphs. Our proposed algorithm, 
termed ReHub {Reverse kNN + Hub labels) extends the Hub Labeling approach to efficiently handle 
these queries. The main advantage of ReHub is that its slower Offline phase depends only on the location 
of the objects P and has to run only once, whereas its Online phase (which depends on the query vertex 
q) is very fast. Still, even the costlier offline phase hardly needs more than \s, while the online phase 
requires typically less than \ms, making ReHub the only R^NN algorithm fast enough for real-time 
applications and big, large-scale graphs. Moreover, the necessary data structures for answering R^NN 
queries may also answer kNN queries and require only a small fraction of the memory required for 
storing the created hub labels for the typical case of vertex-to-vertex queries. Throughout this work, we 
use undirected and unweighted graphs which constitute an important graph class (containing social 
and collaboration networks) but also pose a significant challenge to Hub Labeling algorithms because 
of the sheer size of the created labels. However, our method could be easily adapted for other graph 
classes where the hub-labeling algorithm typically performs well, including road networks. 

The outline of this work is as follows. Sectionj^presents related work. Section|^describes the ReHub 
algorithm and provides a theoretical analysis of its performance. Experiments showcasing ReHub's 
benefits are provided in Section]^ Finally, Section]^ gives conclusions and directions for future work. 

2 Related work 

Given a query point q and a set of objects P, the RkNN query (also referred as the monochromatic 
R^NN query) retrieves all the objects that have q as one of their ^-nearest neighbors, according to a 
distance function distO. Formally R^NN(^) = {p e P : dist{p,q) < dist{p,pk)} where pk is the k- 
Nearest Neighbor (^NN) of p. In Euclidean space, the distance dist{s, t) refers to the Euclidean distance 
between two objects ^ and t. In graph networks, dist{s, t) corresponds to the minimum network distance 
between the two objects. Throughout this work we use undirected, unweighted graphs G{V, E) (where 
V represent vertices and E represents arcs), we assume that objects are located on vertices and we refer 
to snapshot R^NN queries, i.e, objects are not moving. Also, similar to previous works, the term object 
density D refers to the ratio |P|/|L|. 

There is extensive literature focusing on R^NN queries in Euclidean space. Since our work focuses 
on graphs, we only discuss the latter. Regarding road networks, the work of Il30ll uses Network 
Voronoi cells (i.e., the set of vertices and arcs that are closer to the generator object) to answer R^NN 
queries. This work has only been tested on a relatively small network (1 lOiT arcs) and all precomputed 
information is stored in a database. Despite its costly preprocessing (for calculating the Network 
Voronoi cells), queries still require 1.5^ for D = 0.05 and k - The query times further increase to 
32^ for k - ltd . Later works focusing on continuous RkNN queries on road networks IfTTlI have only been 
tested with even smaller road networks {22K arcs) and are different in scope from our work, which 
focuses on snapshot R^NN queries. To the best of our knowledge, the only work focusing on other 
graph classes (besides road networks) is 1^ . This work, too, has only been tested on sparse networks. 


e.g., road networks, grid networks (max degree 10), p2p graphs (avg degree 4) and a very small, sparse 
co-authorship graph {AK nodes). Furthermore, all experimentation there for values ofk> I (up to = 8) 
refers to road networks, so the scalability of the proposed algorithms for denser graphs and larger 
values of k is debatable. Interestingly enough, this work proposed the Eager M algorithm that, similar 
to ReHub, has an offline and an online phase (that uses the precomputed information obtained from 
the offline phase) to accelerate R^NN queries. Unfortunately, both phases are unoptimized. The offline 
phase uses a slow, combined network expansion from all objects, which cannot scale for very dense 
graphs or sparse objects. The faster online phase performs a pruned Dijkstra-like expansion from the 
query vertex and thus, will also be too slow for denser graphs and small values of D. Recently, Borutta 
et al. l!T0ll extended this work for time-dependent road networks, but the presented results were also 
not encouraging. The larger road network tested had 50/: vertices (queries require more than Is for 
^ = 1) and for a road network of 10^ nodes and k - R^NN queries take more than 0.3^ (without 
even adding the I/O cost). In a nutshell, all existing contributions and methods have not been tested 
on dense, large-scale graphs, do not scale well for increasing k values, and their performance highly 
depends on the object density D. 

Our work builds on the 2-hop labeling or Hub Labeling (HL) algorithm of 122111^ in which, during 
preprocessing, we store at every vertex v a forward Lf{v) and a backward label L^(v). The forward label 
Lf{v) is a sequence of pairs (w, dist(v, u)), with u e V. Likewise, the backward label L^(v) contains pairs 
(w, dist(w, v)). Vertices u and w denote the hubs ofv. The generated labels conform to the cover property, 
i.e., for any ^ and t, the set Lf{s) n L^(0 must contain at least one hub that is on the shortest s -t path. 
For undirected graphs L^(v) = Lf{v). To find the network distance dist(s, t) between two vertices ^ and 
t, a HL query must find the hub v g Lf{s) n L^(0 that minimizes the sum dist(s, v) + dist(v, t). Since the 
pairs in each label are sorted by hub, this takes linear time by employing a coordinated sweep over 
both labels. The HL technique has been successfully used for road networks in ||2[l3l[T6lll|. In the 
case of large-scale graphs, the Pruned Landmark Labeling (PLL) algorithm of O ''produces a minimal 
labeling for a specified vertex ordering" IITSlI . Although this work orders vertices by degree, the later work 
of 23 improves the suggested vertex ordering and the storage schema of the hub labels for maximum 
compression. On a similar note, Jiang et al. |24| propose their HopDB algorithm to provide an efficient 
HL index construction when the given graphs and the corresponding index are too big to fit into main 
memory. The HL method has also been used for one-to-many, many-to-many and kNN queries on 
road networks in IITtII and (U respectively. The core contribution of our work is to extend existing HL 
techniques in the context of R/:NN queries on large-scale graphs and the proposed ReHub algorithm, 
presented in the following section. 

3 The ReHub algorithm 

What follows is the description of the ReHub {Reverse ^NN+ Hub labels) algorithm that extends the 
Hub Labeling approach to efficiently handle R/:NN queries on large-scale graphs. ReHub consists of 
two distinct, independent phases: (i) A slower, costlier Offline phase that takes place after the creation 
of the hub labels and depends only on the objects P (regardless of the query vertex q). (ii) An Online 
phase that uses the auxiliary data structures created during the Offline phase to compute the actual 
R^NN query results. The main benefit of the ReHub algorithm is that the costlier offline phase has to 
run only once and may service all R^NN queries for a specific set of objects, whereas the online phase 
(that actually depends on the query vertex q) is very fast (typically less than a Ims). Hence, ReHub may 
be used within the context of real-time applications, operating on large-scale graphs. 

3.1 Offline Phase The offline phase of the ReHub algorithm takes place after the creation of the 
hub labels. Although the ReHub algorithm works with any correct Hub Labeling algorithm, in this 



Vertex 

Hub Labels (h,d) 

0 

(0,0) 

1 

(0,1), (1,0) 

2 

(0,1), (2,0) 

3 

(0,1), (3,0) 

4 

(0,1), (4,0) 

5 

(0,2), (1,1), (5,0) 

6 

(0,2), (1,1), (6,0) 

7 

(0,2), (1,1), (7,0) 

8 

(0,2), (2,1), (8,0) 

9 

(0,2), (3,1), (9,0) 

10 

(0,2), (4,1), (10,0) 

11 

(0,3), (1,2), (5,1), (11,0) 

12 

(0,3), (1,2), (6,1), (12,0) 

13 

(0,3), (1,2), (7,1), (13,0) 


Figure 1 & Table 1: A sample Graph G and the created hub-labels 


work we generate the necessary labels using the PLL algorithm of ||5l, as provided by its authors 
in (61. To highlight the results of the PLL algorithm, the generated labels for the sample undirected, 
unweighted graph G of Figure are shown in Table In the remainder of this work we will refer to 
those labels as the forward labels. We also assume that the target objects are located at vertices 4,10,12, 
i.e., P = {4,10,12}. The respective entries are highlighted in Table For each vertex v, the forward 
label L(v) is an array of pairs (w, dist(v, u) sorted by hub vertex u. This is the starting point for the offline 
phase of the ReHub algorithm, which in turn is divided in three smaller substages: (i) the kNN backward 
labels construction, (ii) the batch kNN calculations from all objects, and (iii) the R/:NN backward labels 
construction. Each of these stages will be described in the following. 

3.1.1 The kNN backward labels construction To efficiently answer one-to-many queries with hub 
labels, we need to store separately the hub labels of the target objects P = {Pi,... P/,... P\p\} ordered by 
hubs m- For each such hub w, those backward labels-to-many is an array of pairs (P/, d(u. Pi). Expanding 
this approach for kNN queries, (Tj showed that if we know the number k in advance (or the maximum 
k we will service for kNN queries), then for each hub we only need to keep the ^-pairs with the smallest 
distances per hub. Although these previous works focused on road networks, the correctness of this 
approach still applies to undirected, unweighted graphs. This process for the sample graph and k = 2 
is shown in Table Eor small-diameter graphs (like the ones used in this work) we will have many 
ties (in terms of distance), but keeping at most ^-labels still ensures correctness. 


















Table 2: The ReHub kNN backward labels creation for the sample graph G,k = I and 

P = {4,10,12} 


Hub 

Backward Labels 
(to-many) 

fcNN Backward Labels 
(k=2) 

ReHub kNN 
Backward Labels (k=l) 

0 

(4,1), (10,2), (12,3) 

(4,1), (10,2) 

(0,1), (1,2) 

1 

(12,2) 

(12,2) 

(2,2) 

4 

(4,0), (10,1) 

(4,0),(10,1) 

(0,0), (1,1) 

6 

(12,1) 

(12,1) 

(2,1) 

10 

(10,0) 

(10,0) 

(1,0) 

12 

(12,0) 

(12,0) 

(2,0) 


KNNLab (P, |P|, k, forwLabels, kNN Lab) 

1 Initialize(kNNLab, (|y|, BoundPQue{k + 1))) 

2 for / = 0 to |P| 

3 for j - Q io forwLabels[P[iW.size 

4 hub - forwLabels[P[i]][j].hub 

5 d = forwLabels[P[i]][j].dist 

6 kNNLab[hub].push{i, d) 

Due to the pruning of the PLL algorithm, in our example, kNN backward labels do not necessarily 
have as many as ^-pairs per hub. To create the kNN backward labels for ReHub, we need to do some 
additional modifications, (i) When answering R^NN queries, we must assume that k = k + I during 
the construction of the kNN backward labels. This is necessary, since in our example the NN of object 
10 (for k - 1) is by definition the same object, but for R^NN queries with = 1, the NN neighbor of 
10 is object 4. (ii) Instead of storing the vertex IDs Pi of the objects in the ^NN backward Labels, we 
store the array index i of each object, as shown in the last column of Table This facilitates faster 
processing during the remaining substages of the offline and online phase of the ReHub algorithm. On 
the technical side, the kNN backward labels creation is quite fast, since we only have to loop through 
the forward labels of the objects in P and use a vector-based bounded priority queue of size ^ + 1 per 
hub to calculate the ^ + 1 pairs with the smallest distances per hub. This method offers two major 
advantages, (i) We do not need to build the intermediate backward labels-to-many data structure 
(column 2, Table |^, which would be much slower, and (ii) when looping through the forward labels 
of each object, pairs with distances greater than the ^ + 1 worst distance previously found for a specific 
hub may be safely ignored. 

The pseudocode for the ^NN backward labels construction is shown in procedure knnLab and 
throughout this process, for each hub we use a vector-based bounded priority queue of size ^ + 1 that 
stores pairs in the form (idx, dist) ordered by distance. 

3.1.2 Batch ^NN calculations from objects After creating the ^NN backward labels (column 4, 
Table [^, we need to calculate the /^-nearest neighbors of each object. To this end, we perform a total 
of |P| X kNN calculations, using the created ^NN backward labels. Each of those ^NN computations 
uses the method implicitly described in [iT|, with the additional constraint that for each object when 
traversing the ^NN backward labels of one of its hubs, we skip the labels corresponding to this specific 
object index. 












Table 3: Batch kNN calculations process for the sample graph G,k-\ and 

P = {4,10,12} 


Obj. 

ID 

Forward Labels 
of Objects 

Hub 

ReHub kNN 
Backward Labels (k=l) 

kNN Results 
(idx, dist) 

4 

(0,1), (4,0) 

-inO 

(0,1), (1,2) 

(1,1) 

''Ll 

(2,2) 

10 

(0,2), (4,1), (10,0) 


(0,0), (1,1) 

(0,1) 


(2,1) 

12 

(0 (^7 0) i 

^10 

(1,0) 

(0,4) 

K 

^12 

(2,0) 


BatCHKnnCalC (P, |P|, k, forwLabels, kNNLab, kNNResults) 

1 Initialize(kNNResults, (|P|, BoundPQue{k))) 

2 parallel for / = 0 to |P| 

3 for j - 0 to forwLabels[P[i]].size 

4 hub - forwLabels[P[i]][j].hub 

5 d = forwLabels[P[i]][j].dist 

6 for k = 0 to kNNLab[hub].size 

7 idx - kNNLab[hub]\k\.idx 

8 if idx\ - i 

9 dl - d + kNNLab{hub^{k\.dist 

10 kNNResults[i\.push{idx, dl) 

The simplified pseudocode for the batch kNN calculations from objects is shown in proce¬ 
dure BatchKnnCalc. The kNNResults are also stored in a |P|-sized vector of vector-based bounded 
priority queues of size k that store pairs in the form {idx, dist) ordered by distance. For each such object, 
when traversing the kNN backward labels of one of its hubs, we skip the pairs corresponding to the 
index of this specific object (Line 8 in the pseudocode). Moreover, every time a new pair is pushed 
to the corresponding queue (Line 10), our customized push operation checks if the "pushed" object 
index already exists in the queue with a smaller or equal distance value than the pushed pair. If yes, 
we can safely ignore this pair. If, on the other hand, this object index exists in the queue with a larger 
distance value, we update this distance value and resort the queue. If the pushed object index does not 
already exist in the queue, our custom push operation checks if the queue has less than k items. In that 
case, the new pair enters the queue and the queue is resorted. If the queue has already k items, our 
push operation checks if the new pair is better (i.e., corresponds to a smaller distance) than the last (k) 
element of the queue. If yes, the last element is popped, the new pair enters the queue at the end and 
the queue is resorted. Since each queue is basically a vector of size k, popping back, pushing back and 
resorting this (rather small) priority queue are very fast operations. 

We can further accelerate the process, if every time a new pair {idx, dl) enters the kNNResults[i\ 
queue for a specific object, we check if the queue already has ^-items; In that case we store the worst 
label distance as a separate variable. If the distance d (Line 5) or the distance dl (Line 9) are greater 
than this worst distance, we can safely skip this particular pair. Especially, in the second case (distance 
dl - Line 9) we can exit the third loop (Line 6) completely, since the kNN backward label of each hub 
is ordered by distance. This optimization (not shown in the pseudocode for readability) accelerates 
significantly each individual kNN calculation. 

The results of this process are shown on Table where the combination of the forward labels of 
the objects {4,10,12} with the kNN backward labels shows that the kNN of object 4 is the object with 













Table 4: R^NN backward labels construction for the sample graph G,k-\ and 

P = {4,10,12} 


Obj. 

ID 

kNN Result 
(idx, dist) 

Forward Labels 
of Objects 

Hub 

RkNN Backward Labels 
(k=l) 

4 

(14) 

(0,1), (4,0) 

0 

(0,1), (2,3) 

1 

(2,2) 

10 

(0,1) 

(0 o\ (A.']\ n0 (Li 

4 

(0,0), (1,1) 


6 

(2,1) 

12 

(0,4) 

(0,3), (1,2), (6,1), (12,0) 

10 

(1,0) 

12 

(2,0) 


index 1, i.e., object 10, with distance 1. The kNN of object 10 is the object with index 0 (object 4) with 
the respective distance 1 and finally, the kNN of object 12 is the object with index 0 (object 4) with the 
respective distance 4. To facilitate faster computation, each kNN computation may be performed in 
parallel (Line 2 of procedure BatchKnnCalc) since there is no interaction between the individual 
kNN calculations. Considering this is the slower substage of the offline phase, employing parallelism 
significantly drops the total preprocessing time required for the ReHub's offline phase. 

3.1.3 The RkNN backward labels construction After calculating the kNN of each object, for answer¬ 
ing R^NN queries it would suffice to run an one-to-many HL query from the query vertex q to all objects, 
by constructing and using the backward labels-to-many of objects P (see column 2, Tableand then loop 
through the calculated distances to see if they are smaller or equal to the kNN distances calculated 
by the previous step. But we can do much better: We construct an alternative data structure, referred 
hereafter as the RkNN backward labels, based on the observation that we need to calculate distances to a 
specific object, if and only if those distances are equal or smaller than the distance of the kNN of this object. If 
the objects are uniformly distributed through the graph, this optimization ensures that only hubs of 
relatively small distances from each object are added to the R^NN backward labels. Therefore, during 
the online phase, if the query vertex q is faraway from some objects, there would be no matching hubs 
between those objects and the query vertex. 

The resulting pseudocode for the R^NN backward labels construction is shown in proce¬ 
dure RknnLab and the entire process is highlighted in Table When we build the R^NN labels 
for object 10, we skip the pair (0,2) because the NN of object 10 is within distance of 1 and therefore 
pairs with greater distances than that (for this particular object) may be safely ignored. Again, when 
building the R^NN backward labels we use the objects array indexes, instead of their IDs. 

RknnLab (P, |P|, k, forwLabels, kNN Re suits, RkNN Lab) 

1 Initialize(RkNNLab, (|y|, vector <(idx,dist)>)) 

2 for / = 0 to |P| 

3 for j - io forwLabels[P[iW.size 

4 d - forwLabels{P{iW{j^.dist 

5 ii d <- kNNResults[i\[k - 1] 

6 hub - forwLabels[P[i]][j].hub 

7 RkNNLab[hub].pushJ?ack{i, dist) 

Several interesting observations can be made by comparing Tables and Firstly, as expected, 
the number of R^NN backward labels (column 5, Table Q is smaller than the backward labels-to-many 
(column 2, Table|^. Although for our small sample graph G this difference is minimal, for larger graphs 














it becomes significant. Therefore, using the R^NN backward labels will significantly improve the 
online phase of the ReHub algorithm. This will be clearly showcased in our experimentation presented 
in Section]^ Second, the kNN backward labels (column 4, Tableare different than R^NN backward 
labels (column 5, Table Q. The added benefit is that by using the kNN backward labels we can still 
answer kNN queries and by using the R^NN backward labels we can answer R^NN queries within the 
same framework. 

3.2 Online Phase The offline phase of the ReHub algorithm runs only once for a specific set of 
objects P. Its final output is (i) a matrix of size |P|x^ of (ordered by distance per row) pairs {idx, dist) that 
contain the kNN of each object and (ii) the R^NN backward labels. The following online phase of the 
ReHub algorithm is basically a modified one-to-many HL query from the query vertex q that operates 
on the R^NN backward labels and is described by the pseudocode of procedure OnlinePhase. The 
output of the online phase is a vector (denoted out in the pseudocode) of size |P| with all values set 
to infinity, except those that belong to the indexes of the objects of the R^NN set; those values are set 
to the correct distances from query vertex q to the respective objects. In our running example of the 
sample graph G, P - {4,10,12} and k - the online phase for a R^NN query from vertex 0 would 
only have to visit the R^NN backward labels of hub 0 (see Tables and and would output the result 
out - {1, cx), 3), meaning that the objects 4, 12 belong to the R^NN set of vertex 0 with distances 1 and 3 
respectively. 

Theorem 3.1. The ReHub algorithm is correct. 

Proof. Building the kNN backward labels and then performing the batch kNN calculations to calculate 
the kNN of each object is correct, because it follows the methodology of Abraham et al. HI who proved 
its correctness. Building the R^NN backward labels is also correct, since we just reorder all labels of the 
objects according to hub, except those that correspond to distances greater than the kNN of its object. 
This ensures than we can calculate correct distances to any of those objects from any query vertex, 
except when this query vertex is farther than the kNN of a specific object. The online phase is also 
correct, since it operates on the R^NN backward labels and updates the result vector out for a specified 
object, only when the calculated distance is smaller or equal than the distance of the kNN of this object 
(Line 8, procedure OnlinePhase). Therefore the ReHub algorithm is also correct. 

OnlinePhase {q, P, |P|, k, forwLabels, kNN Re suits, RkNNLab, out) 

1 Initialize((9M^, (|P|, oo)) 

2 for / = 0 to forwLabels[q].size 

3 hub = forwLabels[q][i].hub 

4 d = f orw Label s[q][i].dist 

5 for j - 0 to RkNNLab[hub].size 

6 idx - RkNNLab[hub}[j}.idx 

7 d2 - d RkNNLab{hub^{j^.dist 

8 if J2 < out{idx} & 

d2 < kNNResults[idx\[k - \}.dist 

9 out{idx^ = d2 

The main advantage of the ReHub algorithm, in comparison to previous works, is the separation 
between the costlier offline phase, which runs only once for a specific set of objects and the very fast 
online phase. An additional benefit of ReHub compared to the works of ||33[lQ| is that not only ReHub 


calculates the R^NN set of the query vertex but it also calculates the correct network distances from the 
query vertex to any of the objects belonging in the RkNN set. Regarding the online phase, operating on the 
R^NN backward labels is significantly faster, since for large graphs those R^NN backward labels are 
significantly fewer than the backward labels-to-many. Also the usage of object array indexes instead 
of the object IDs accelerates the whole process, since the final results vector out is of size |P| instead of 
|y| which makes its initialization faster (Line 1, procedure OnlinePhase). Also, accessing the kNN 
results of each object (Line 8) and the previous best value of results table (Line 8 and 9) are very cheap 
operations, since they operate on smaller vectors of size |P|. Moreover, the memory required for storing 
these intermediate data structures is also significantly smaller. This will be further quantified in the 
next section, where we analyze the complexity and memory requirements of the ReHub algorithm. 

3.3 Complexity Analysis and Memory Requirements If D is the object density defined as D = j^, 

then the number of objects is D • |y|. The forward label of each vertex has an average of hubs, where 
\HL\ is the total number of labels created by the hub-labeling algorithm (PLL in our case). Especially in 
the case of the PLL algorithm this approximation is pretty accurate, since Akiba et al. [5] have shown 
that the "size of the created labels does not differ much for different vertices and few vertices have much larger 
labels than the average". Since we have D '\V\ objects and hubs per object, then the backward labels- 
to-many will have on average D • \HL\ pairs. Regarding the offline phase, the kNN backward labels 
construction needs to access all those D • \HL\ pairs (same as the backward labels-to-many) to construct 
the kNN backward labels that have a maximum of ^ + 1 pairs per hub. In the batch kNN calculations, 
we have a total of D • |y| kNN queries that each needs to access on average {k + 1) • pairs to create 
the kNN results of size of ^ • D • |y|. Therefore, the complexity of the batch kNN calculations will be 
(k + 1) • D • \HL\. Finally, for the R^NN backward labels construction we need to access D • \HL\ pairs 
(same as the backward labels-to-many) and the k‘D-\V\ results (to retrieve the worst k label per object). 
Conclusively, both the kNN and R^NN backward labels construction have a complexity of D • \HL\ each 
(since \HL\ » |y|), where the most costly batch kNN calculations stage has a complexity of {k+\)-D-\HL\. 

Regarding the online phase, for a very large number of k, the online phase of ReHub will degrade 
to a one-to-many query between the query vertex q and the set of objects P. Therefore, we will first 
analyze the complexity of an one-to-many HL query. As showed earlier, the backward labels-to-many 
will have on average D • \HL\ pairs. If those pairs are equally distributed per hub, then each hub on the 
backward labels-to-many will have an average of D • pairs. Since the forward label of the query 
vertex q will have on average of hubs, an one-to-many query from the query vertex will access on 

average D*(!^)^ pairs. Thus, the online phase of ReHub will access pairs, where s < \ (since 

the size of the R^NN backward labels is smaller than the backward labels-to-many) and s - f(k, D), 
i.e., the value of s depends on the density D and the cardinality k of the R^NN results. In fact, our 
experimentation have showed that s becomes smaller for larger values of D and smaller values of k. 
The aforementioned theoretical results are summarized in Table where we also report the memory 
required for storing the results of each stage, considering the fact that each pair requires 5 bytes for 
storage (4 bytes for object index + 1 byte for distance) and the result of the online phase is a sized D-\V\ 
vector of distances. 

Our theoretical evaluation shows that even for large values of k where the online phase of ReHub 
would converge to an one-to-many query, ReHub's online performance will remain excellent, as long 
as the fraction is relatively small, i.e., the number of created labels is proportionate to the number 

of graph vertices. As our experimentation showed (see Sectionj^, this fraction was below 5,000 for 
all network graphs we experimented with. 


Table 5: ReHub complexity and memory requirements 


Stage 

Complexity 

Memory for 
storing result (B) 

A:NN backward labels construction 

D ■ \HL\ 

5-(k+\)-\V\ 

Batch ^NN calculations 

(k+l)-D- \HL\ 

5-k-D-\V\ 

R^NN backward labels construction 

D • \HL\ 

5 e D- \HL\ 

Online Phase 


D-\V\ 


Table 6: Networks graphs statistics 


Graph 

|V| 

|E| 

Avg degr. 

|HL|/|V| 

PLL Preproc. Time (s) 

Facebook 

4,039 

88,234 

22 

26 

0.03 

NotreDame 

325,729 

1,090,108 

3 

55 

6 

Gowalla 

196,591 

950,327 

5 

100 

13 

Youtube 

1,134,890 

2,987,624 

3 

167 

123 

SlashdotOSll 

77,360 

469,180 

6 

204 

11 

Slashdot0922 

82,168 

504,230 

6 

216 

13 

Citeseerl 

268,495 

1,156,647 

4 

408 

110 

Amazon 

334,863 

925,872 

3 

689 

230 

DBLP 

540,486 

15,245,729 

28 

3,628 

5,720 

Citeseer2 

434,102 

16,036,720 

37 

4,457 

5,946 


3.4 Extension to Directed and Weighted Graphs Throughout this work and the experimentation 
described in Section we use undirected and unweighted graphs. However, the ReHub algorithm 
may be easily extended to directed graphs with the following changes: (i) In the offline phase the 
^NN backward labels must be constructed from the backward labels (ii) In the online phase we must 
use the backward labels of query vertex q. Note that most previous methods like i[32l \TW have only been 
applied on undirected networks. For weighted graphs, ReHub will work without requiring any further 
modifications. 

4 Experiments 

To evaluate the performance of ReHub on various large-scale graphs, we conducted experiments on 
a workstation with a 4-core Intel 17-4771 processor clocked at 3.5GHz and 32 GB of RAM, running 
Ubuntu 14.04. Our code was written in C++ with GCC 4.8 and optimization level 3. We used OpenMP 
for parallelization. 

The network graphs used in our experiments were taken from the Stanford Large Network 
Dataset Collection 1261 and the 10th Dimacs Implementation Challenge website [I8|. All graphs are 
undirected, unweighted and strongly connected. We used colla-boration graphs (DBLP, Citeseerl, 
Citeseer2) 1231 , social networks (Facebook |!29l, Slashdotl and Slashdot2 l27l L networks with ground- 
truth communities (Amazon, Youtube) ||3T1, web graphs (Notre Dame) |7! and location-based social 
networks (Gowalla) |T2l. The graphs' average degree is between 3 and 37 and the PLL algorithm 
creates 26 - 4,457 labels per vertex, requiring 0.03 - 5,950^ for the hub labels' construction (see Tablej^. 
For each individual R^NN experiment we generate randomly 100 sets of objects of size |P| and then 
we generate 100 random query points per set. R^NN query times are then averaged over those 10,000 
experiments. 
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Figure 2: Offline and online phases of ReHub for k = I and varying values of D 


4.1 Performance and Memory Requirements In our first round of experiments we evaluate the 
performance of ReHub in comparison to the object density D - |P|/|y|. Figurej^reports the time required 
for the offline and online phases of ReHub for k - I and D - {0.001,0.005,0.01,0.05,0.1}, similar to the 
methodology followed in 1321 . For the offline phase, we parallelized only the slower substage of batch 
kNN computations from objects. Online phase is always sequential. 

We see that the offlne phase of ReHub for most graphs takes less than 250m^ for all values of D. 
Even for the worst performing graphs (DBLP, Citeseer2) it takes less than 1^, except for very dense 
distributions of objects {D = 0.1). Considering the fact, that such dense distribution of objects is not 
common and comparing the offline phase's time with the construction of the hub labels (Table [^, 
even the time required for the offline phase could be considered negligible. Regarding the online 
phase, results are even more impressive. For all graphs and all values of D, the online phase takes 
less than 0.6m^, except for the Youtube graph and D = 0.1 (l.lm^). Generally speaking, both phases 
perform worse for increasing values of D (and |P|) and for larger number of labels per vertex (e.g. DBLP, 


Citeseer2), as predicted by the theoretical analysis of the ReHub algorithm presented in Section 3.3 


In terms of memory requirements. Figure [^reports the memory required for storing the additional 
data structures for ReHub {kNN backward labels, kNN results per object and the R^NN backward 
labels) and the size of the R^NN backward labels in comparison to the backward labels-to-many, for 
the same setting as our previous experiment (i.e., for k = I and D - (0.001,0.005,0.01,0.05,0.lj). Results 
show that the memory required for the additional data structures for ReHub is less than 13Mb even for 
the worst performing graphs (DBLP, Citeseer2) and the size of the R^NN backward labels can be more 
than lOOx smaller than the size of the backward labels-to-many for very dense objects {D - 0.1), i.e., the 
corresponding online phase there will be consequently lOOx faster than an one-to-many query. This 
shows that even for such large values of the density D (which constitutes the worst-case scenario for 
ReHub) its online phase will still remain very fast and efficient. 

In our second round of experiments, similar to Il32]| , we assess the performance and memory 
characteristics of the ReHub algorithm in comparison to k. To this end. Figure reports the time 
required for the offline and online phases of the ReHub algorithm for D - 0.01 and = (1,2,4,8,16,32). 
Again, for the offline phase, we parallelized only the batch kNN computations from objects. 

As before, the offlne phase of ReHub takes less than 250m^ for most graphs and values of k. Even 
for the worst performing graphs (DBLP, Citeseer2) it takes less than l.D, except for k = 32 (Us). 
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Figure 3: Memory footprint of ReHub for k = I and varying values of D 



(a) Offline phase. Y-axis is on logarithmic scale and time is reported (b) Online phase. Y-axis is on logarithmic scale and time is reported 
in ms in jis 


Figure 4: Offline and online phases of ReHub for D = 0.01 and varying values of k 


The online phase takes less than 1.1ms, except the Citeseer2 and DBLP graphs and k - 32 {2.4ms and 
2.1ms respectively). In conclusion, although ReHub's performance degrades for increasing values of k, 
its performance remains excellent throughout. Interestingly enough, for the top-4 graphs in terms of 
forward labels per vertex (Amazon, Citeseer 1 and 2, DBLP), the offline phase is 4.4-6x slower and the 
online phase 8-14x slower for k - 32 than for k - This further demonstrates how the reduced size of 
the RkNN backward labels (that depends on k) improves ReHub's online phase performance, in comparison to 
running a plain one-to-many HL query between the query point and the objects (which would be much 
slower, even than the value observed for k - 32). 

Regarding memory requirements. Figure [^reports the memory required for storing the additional 
data structures for ReHub {kNN backward labels, kNN results per object and R^NN backward labels) 
and the size of the R^NN backward labels in comparison to the backward labels-to-many, with respect 
to varying values of k (again for D = 0.01). Results show that the memory required for the additional 
data structures for ReHub is less than 50Mb even for k -32 and the worst performing graphs, whereas 
the size of the R^NN backward labels can be more than 3x smaller than the size of the backward 
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Figure 5: Memory footprint of ReHub for D = 0.01 and varying values of k 
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Figure 6: Offline and online phases of ReHub for k - I, D - 0.01 and varying values of B 


labels-to-many for the same value of k, i.e., the corresponding online phase will be consequently 3x 
faster than an one-to-many query, even for k - 32. This shows that even for large values of k, ReHub's 
online phase will still be very fast and efficient. Moreover, our experimental results are entirely consistent 
with the theoretical analysis of ReHub provided in Section ^^ decvd show that the variable s introduced there, 
gets progressively smaller for larger values of D and smaller values of k. 

On our third round of experiments, we evaluate the impact of objects distribution to ReHub's 
performance. To that purpose, we adapt a methodology similar to IITTII . We pick a vertex at random 
and run BFS from it until reaching a predetermined number of vertices \B\. If B is the set of vertices 
visited during this search, we pick our objects O as a random subset of B. We keep the density of 
objects steady at D = 0.01 and we experiment with different values of \B\ represented as percent of the 
total graph vertices. 

Again, we see that ReHuB provides excellent performance both for the online and offline phase, 
regardless of the objects' distribution. In fact, ReHub performance is even better when the objects 
are more concentrated within the graph (e.g., for B - 0.01) instead of randomly distributed objects 
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Figure 7: Memory footprint of ReHub for ^ = 1, D = 0.01 and varying values of B 


{B = 1). As a result, the offline phase is 1.6-2.3X faster for B = 0.01 than B = I and the online phase 
is 1.5-3.8X faster for B - 0.01 than B = 1. This great improvement, especially in the online phase, is 
attributed to the fact that the R^NN backward labels are smaller, since objects are closer to each other, 
which facilitates faster online queries. This fact further testifies to the robustness of ReHub, even for 
more skewed distributions of objects. This is also evident in Figure]^ which shows that the memory 
requirements of ReHub are smaller for smaller values of B and the difference size between the RkNN 
backward labels in comparison to the backward labels-to-many is also amplified for smaller values 
of B. 

4.2 Comparison with Previous works Our experimentation has shown that ReHub exhibits excellent 
query performance and requires very small additional memory for all tested networks, regardless of 
the object density, the number k of R^NN neighbors or the distribution of objects. In comparison to 
previous works, ReHub may handle two orders of magnitude larger, denser networks than /l32l [30l \TW , may 
scale easily for k = 32, where previous secondary storage methods have only been tested for up to 
^ = 8 ||32l[T3 or ^ = 20 1^301 . But even then, e.g., for ^ = 8 those methods require more than 300m.y IfTOll , 
whereas for similarly small networks (e.g. Gowalla) ReHub's offline phase requires < 2Qms and the 
online phase < 0.02m.y. Even for larger networks, the online phase typically requires less than \ms, 
i.e., ReHub is at least 3 orders of magnitude faster. Thus, ReHub is the most complete solution for R^NN 
queries on social and collaboration networks and the best overall contender for real-time applications. 
In addition, Efentakis et al IfT^ have showed how the online phase of ReHub may be translated to a 
simple SQL query on a open-source database engine, making ReHub the only R^NN solution that may 
also be used on a pure SQL context, for even greater versatility and scalability 

Moreover, we showed that ReHub can handle networks where the size of the created labels are more 
than three thousand hubs per vertex (e.g., DBLP, Citeseer2) and hence, the proposed algorithm will be 
even more efficient and faster when applied to sparser graph classes such as road networks, where the 
size of the created labels are less than a few hundred hubs per vertex, even for worse behaving metrics 
(e.g travel distances). 


















































































5 Conclusion and Future Work 


This work introduced ReHub, a novel main-memory algorithm that extends the hub labeling method 
to efficiently handle R^NN queries on large-scale graphs. Our experimentation showed that ReHub 
provides excellent query performance, has minimal memory requirements, and scales very well with 
the network size, the object density, the object distribution, the size of the labels, and the cardinality of 
the reverse ^-nearest neighbor result. Given these results, ReHub is the best overall and most complete 
solution for this type of queries and may also be used in the context of real-time applications. 

Directions for future work are to extend ReHub towards handling object updates, i.e., objects may 
be added or deleted from the objects' set. Not having to redo the offline phase from scratch for such 
updates will significantly increase the practical applicability of the algorithm. Also testing our results 
on directed graphs and road networks will further showcase the algorithm's performance with respect 
to a wider range of graph classes, additional hub labelling algorithms, and domains. 
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